Skip to content

smp: fix proxy reconnection to relay after restart#1806

Open
shumvgolove wants to merge 3 commits into
masterfrom
sh/fix-proxy-reconnect
Open

smp: fix proxy reconnection to relay after restart#1806
shumvgolove wants to merge 3 commits into
masterfrom
sh/fix-proxy-reconnect

Conversation

@shumvgolove

Copy link
Copy Markdown
Collaborator

Problem

An SMP proxy permanently stops reconnecting to a destination relay after the relay restarts. The logs show repeated PCEResponseTimeout for that relay, and only restarting the proxy server recovers it.

Cause

A PRXY request makes the proxy open a connection to the relay in a worker forked from the sender's client. The worker inserts an empty session var into smpClients and then blocks in the connection/handshake. If the sender disconnects while that connect is in flight, the worker is killed by an async exception before the session var is ever filled.

Nothing removes an empty session var, so every later request to that relay waits on it until the connection timeout and fails with PROXY (BROKER TIMEOUT) - forever, even once the relay is healthy again.

Reproduces the proxy failing to reconnect to a destination relay when the
sender disconnects mid-connection (empty session var left in smpClients).
getSessVar inserts an empty session var that the connect path then fills with
putTMVar. If the connecting thread is killed by an async exception before that
fill (a proxy worker on client disconnect, an agent worker on cancel), the empty
var was left in the map forever and every later request for that server blocked
on it until timing out (permanent PCEResponseTimeout).

Wrap get-or-create with withGetSessVar (bracketOnError) at the call sites, so the
cleanup is established where the var is created and covers the whole connect: on
interrupt before fill the still-empty var is dropped and the next request
reconnects. This closes the window between getSessVar and the fill that a handler
installed inside the connect function cannot cover.
UtilTests: tryAllErrors rethrows ThreadKilled/StackOverflow (the mechanism
that skips putTMVar). SMPProxyTests: agent client reconnection after a
cancelled connect, plus a control proving the stalling relay alone does not
cause the failure; refine the relay reconnection tests.
@shumvgolove shumvgolove force-pushed the sh/fix-proxy-reconnect branch from 1ddaee7 to 4642b2b Compare June 26, 2026 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant